Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

Closed
wants to merge 5 commits into from

Conversation

annjawn
Copy link

@annjawn annjawn commented Sep 10, 2023

This PR introduces enhancements to the Azure Document Intelligence document loader.

  • Uses paragraphs for creation of full page text introducing efficiency with lesser iterations. Paragraphs are supported by all models.
  • Supports paragraphs via split_mode during initialization of DocumentIntelligenceLoader. This defaults to page in which case the full text of the page will be returned. If paragraph is used in split_mode then Documents will be returned in chunks by paragraphs. Paragraphs may be useful in generating embeddings in smaller chunks instead of having to split the full page text yet again.
  • Provides table data extraction if the model specified is either prebuilt-document, prebuilt-layout, or prebuilt-invoice. This is useful for developers who intend to use tables with Self-query.
  • Introduces a type key in Document metadata to help distinguish just page text vs paragraph vs tables with PAGE, PARAGRAPH, TABLE_HEADER and TABLE_ROW.
  • For tables, provides the headers and rows in CSV format along with the table index while retaining the page number, which can be used to load vector db for self query. Note: metadata formatting with Document schema for self-query will still be needed which can be done with the help of type key (TABLE_HEADER and TABLE_ROW), table_index, and page.

Sample usage

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
from langchain.document_loaders.pdf import DocumentIntelligenceLoader

document_analysis_client = DocumentAnalysisClient(endpoint="<endpoint>", credential=AzureKeyCredential("<key>"))

loader = DocumentIntelligenceLoader("./document.pdf",
    client=document_analysis_client,
    model="prebuilt-document",
    split_mode="paragraph"          # optional, defaults to `page`
) 
documents = loader.load()

tables = list()
for doc in documents:
  if doc.metadata['type'] in ['PAGE', 'PARAGRAPH']:
    # page text
    print(f"====Page {doc.metadata['page']} {doc.metadata['type']}-text====\n\n")
    print(doc)
    print("\n\n")
  elif doc.metadata['type'] in ['TABLE_HEADER', 'TABLE_ROW']:
    tables.append(doc)

# first table in the document
table1 = [d for d in tables if d.metadata['table_index'] == 0]
# second table in the document
table2 = [d for d in tables if d.metadata['table_index'] == 1]
# third table in the document
table3 = [d for d in tables if d.metadata['table_index'] == 2]

@vercel
Copy link

vercel bot commented Sep 10, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 14, 2023 2:47am

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Sep 10, 2023
@LarsAC
Copy link
Contributor

LarsAC commented Sep 11, 2023

Great addition, @annjawn. I just made an update for DocumentIntelligence which also rolls up paragraphs that are under the same SectionHeading to generate larger documents that are semantically similar per structure of the document. Also from previous experiments it seems that tables and paragraphs from the DI API overlap, so I used the spans to come up with an ordered list of non-overlapping text chunks. Would you mind having a look at https://github.com/LarsAC/langchain/tree/larsac/azure-di?

@annjawn
Copy link
Author

annjawn commented Sep 11, 2023

Hey @LarsAC , yes technically LINES/WORDS/PARAGRAPHS will indeed overlap with TABLE. The idea for including TABLE is to provide a way for people to use that in Self-query. However, we may still want to include it in the text as well (it's probably a matter of just looking at type = PAGE | PARAGRAPH if the user is interested in plain text only, or type = TABLE_HEADER & TABLE_ROW if user is interested in the Tables. I also just added PAGE level and PARAGRAPH level chunking based on which the user prefers using split_mode.

I will definitely take a look at your updates 👍, though I am thinking if we should still provide some flexibility of how the user may want to retrieve the text from the doc.

@LarsAC
Copy link
Contributor

LarsAC commented Sep 13, 2023

@annjawn Fully agree with the flexibility. I had also added a "switch" parameter to the constructor of the loader in order to let the user control how to parse the text. We could likely add more options in parallel.


def __init__(self, client: Any, model: str):
def __init__(self, client: Any, model: str, split_mode: str):
Copy link
Collaborator

@baskaryan baskaryan Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we give this default val, probably "page"? so this isn't a breaking change and default behavior doesn't change too much

Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but can we have default here as well, in case this object is instantiated directly by a user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can default "page" here as well @baskaryan


def _generate_docs(self, blob: Blob, result: Any) -> Iterator[Document]:
for p in result.pages:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if split mode is page should we just keep existing logic? is there value in parsing by paragraph and re-assembling pages?

Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan the idea of providing paragraphs as an option is to do chunking (splitting) as supported by the azure AI cognitive layout capabilities rather than having to do chunking again using, let’s say a Text Splitter. This would be helpful for generating embeddings of chunks (paragraphs) that will retain the semantic consistency of the text. We won’t reassemble the paragraphs back into pages if paragraph is used rather we will keep it the way Doc intel’s layout extracts it. If the user specifies page explicitly or just doesn’t pass the parameter at initialization then page will be defaulted and entire page text will be generated per page. Hope this makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is why not do something like

if self.split_mode == "page":
    for p in result.pages:
        ...
elif self.split_mode == "paragraph":
    for p in result.paragraphs:
        ...

to save us having to write logic for reassembling paragraphs into pages in the case that split mode is page

Copy link
Author

@annjawn annjawn Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan right, I am actually doing this here. the result object doesn't have each page's full text individually in the pages attribute as it may seem, we actually construct pages by concatenating paragraphs. The highest grouped entity that Doc intelligence goes up to is the entire document (all text from all pages concatenated into one) and then its per page paragraph (then lines, then words). The content object in result is combination of all text from all pages, so it's just easier to assemble per page by paragraph instead of trying to split content into individual pages, but that assembly (of paragraphs) will only happen if self.split_mode == "page". Here's a structure for better explanation.

Screenshot 2023-09-15 at 8 51 57 PM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a sample JSON output from a 2 page document extracted via prebuilt-read model.

output.json.zip

file_path: str,
client: Any,
model: str = "prebuilt-document",
split_mode: str = "page",
Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan here's where it's defaulted to page, so it won't introduce any breaking change.

"type": "PAGE",
},
)
yield d
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan here's page vs paragraph logic. If page is used then we do subsequent collation of paragraphs into individual page's full text and specify "type": "PAGE" in Document. If paragraph is used then we keep it as is and simply yield with paragraphs in the Document schema with "type": "PARAGRAPH"

@zifeiq
Copy link
Contributor

zifeiq commented Nov 20, 2023

Hi @annjawn may I understand what is the plan for this PR? Is the PR going to be updated and merged?

@hwchase17 hwchase17 closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024
@baskaryan
Copy link
Collaborator

Apologies for the slow review! Pr has some merge conflicts, happy to re-review if you'd like to resolve

@ccurme ccurme added community Related to langchain-community langchain Related to the langchain package labels Jun 18, 2024
@hwchase17 hwchase17 closed this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases langchain Related to the langchain package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants